Analysis of STD Infection Rates in US: 1996-2011

Introduction

The dataset is taken from https://wonder.cdc.gov. It focuses on STD Infection Rate in United States during 1996-2011. The Rate is calculated per 100,000. It contains Chlamydia, Gonorrhea, and Syphilis diseases. The analysis will give an idea about the rise and fall of different STDs in America. As per the website there are no records for New York from 1996 to 1999. The citation is as follows:

US Department of Health and Human Services, Centers for Disease Control and Prevention, National Center for HIV, STD and TB Prevention (NCHSTP), Division of STD/HIV Prevention, Sexually Transmitted Disease Morbidity for selected STDs by age, race/ethnicity and gender 1996-2011 Archive, CDC WONDER Online Database.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline
sb.set_style('darkgrid')
#sb.set_palettet(sb.color_palette()[0])
In [2]:
import plotly.express as px
In [3]:
df = pd.read_csv('STDs by Age,Year,State and Gender, 1996-2011 Archive.csv')
df.head()
Out[3]:
Disease Disease Code State State Code Year Year Code Gender Gender Code Age Age Code STD Cases Population Rate
0 Chlamydia 274 Alabama 1 1996 1996 Female F 0-14 years 0-14 126 446376 28.23
1 Chlamydia 274 Alabama 1 1996 1996 Female F 15-19 years 15-19 1802 160943 1119.65
2 Chlamydia 274 Alabama 1 1996 1996 Female F 20-24 years 20-24 1405 156405 898.31
3 Chlamydia 274 Alabama 1 1996 1996 Female F 25-29 years 25-29 405 156680 258.49
4 Chlamydia 274 Alabama 1 1996 1996 Female F 30-34 years 30-34 144 161922 88.93
In [4]:
df.tail()
Out[4]:
Disease Disease Code State State Code Year Year Code Gender Gender Code Age Age Code STD Cases Population Rate
27048 Primary and Secondary Syphilis 310 Wisconsin 55 2011 2011 Male M 20-24 years 20-24 14 196897 7.11
27049 Primary and Secondary Syphilis 310 Wisconsin 55 2011 2011 Male M 25-29 years 25-29 10 189349 5.28
27050 Primary and Secondary Syphilis 310 Wisconsin 55 2011 2011 Male M 30-34 years 30-34 4 178120 2.25
27051 Primary and Secondary Syphilis 310 Wisconsin 55 2011 2011 Male M 35-39 years 35-39 5 174619 2.86
27052 Primary and Secondary Syphilis 310 Wisconsin 55 2011 2011 Male M 40+ years 40+ 18 1314703 1.37
In [5]:
df.shape
Out[5]:
(27053, 13)
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27053 entries, 0 to 27052
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Disease       27053 non-null  object 
 1   Disease Code  27053 non-null  int64  
 2   State         27053 non-null  object 
 3   State Code    27053 non-null  int64  
 4   Year          27053 non-null  int64  
 5   Year Code     27053 non-null  int64  
 6   Gender        27053 non-null  object 
 7   Gender Code   27053 non-null  object 
 8   Age           27053 non-null  object 
 9   Age Code      27053 non-null  object 
 10  STD Cases     27053 non-null  int64  
 11  Population    27053 non-null  int64  
 12  Rate          27053 non-null  float64
dtypes: float64(1), int64(6), object(6)
memory usage: 2.7+ MB

Data Cleaning

In [7]:
# replace spaces with underscores and lowercase labels for 2008 dataset
df.rename(columns=lambda x: x.strip().lower().replace(" ", "_"),
             inplace=True)
df.head()
Out[7]:
disease disease_code state state_code year year_code gender gender_code age age_code std_cases population rate
0 Chlamydia 274 Alabama 1 1996 1996 Female F 0-14 years 0-14 126 446376 28.23
1 Chlamydia 274 Alabama 1 1996 1996 Female F 15-19 years 15-19 1802 160943 1119.65
2 Chlamydia 274 Alabama 1 1996 1996 Female F 20-24 years 20-24 1405 156405 898.31
3 Chlamydia 274 Alabama 1 1996 1996 Female F 25-29 years 25-29 405 156680 258.49
4 Chlamydia 274 Alabama 1 1996 1996 Female F 30-34 years 30-34 144 161922 88.93
In [8]:
df.duplicated().sum()
Out[8]:
0
In [9]:
df.isna().sum()
Out[9]:
disease         0
disease_code    0
state           0
state_code      0
year            0
year_code       0
gender          0
gender_code     0
age             0
age_code        0
std_cases       0
population      0
rate            0
dtype: int64

What is the structure of your dataset?

The dataset is pretty clean containing 27053 records with 13 columns.The columns are as follows
['Disease', 'Disease Code', 'State', 'State Code', 'Year', 'Year Code', 'Gender', 'Gender Code', 'Age', 'Age Code', 'STD Cases', 'Population', 'Rate']

What are the main features of interest in your dataset?

• STD Cases
• Rate
• Year
• Disease
• State
• Gender

What features in the dataset do you think will help support your investigation into your features of interest?

• Disease
• STD Cases
• Rate
• Year

In [ ]:
 

Univariate Plots

What is the trend of STDs in the States ?

In [10]:
plt.figure(figsize=(27,17))
sb.countplot(data=df,y=df['state'],
             color=sb.color_palette()[0],order=df['state'].value_counts().index)
plt.title('Count of states that reported STD cases',fontsize=20)
plt.xlabel('count',fontsize=17)
plt.yticks(fontsize=15)
plt.xticks(fontsize=15)
plt.ylabel('State',fontsize=17);

From the above visualization we can see that:

• Louisiana, Texas and Tennessee have the highest reported STD cases
• Vermont and Wyoming are the states with less number of reported STD cases

Which gender suffer more from STDs ?

In [11]:
plt.figure(figsize=(10,7))

#plot
sb.countplot(data=df, x=df['gender'],
             order=df['gender'].value_counts().index ,color=sb.color_palette()[0],);

# setting title and labels
plt.title('Which gender suffer more from STDs ?',fontsize=20)
plt.xticks(fontsize=15)
plt.xlabel('Gender',fontsize=15)
plt.yticks(fontsize=15)
plt.ylabel('Frequency',fontsize=15);

From above chart,

• It is evident to say that Males suffer more from STDs than Females

How STD rates are distributed over the dataset ?

In [12]:
df['rate'].describe()
Out[12]:
count    27053.000000
mean       260.536064
std        528.756490
min          0.090000
25%          8.770000
50%         56.210000
75%        255.840000
max       6982.070000
Name: rate, dtype: float64
In [13]:
np.log10(df['rate'].describe())
Out[13]:
count    4.432215
mean     2.415868
std      2.723256
min     -1.045757
25%      0.943000
50%      1.749814
75%      2.407968
max      3.843984
Name: rate, dtype: float64
In [14]:
plt.figure(figsize=(17,10))
plt.suptitle('Distribution of STD rates',fontsize=20)
#left histogram: data plotted in natural points
plt.subplot(1,2,1)
bins = np.arange(0,df['rate'].max() + 100,100)
plt.hist(data=df,x='rate',bins=bins);
plt.xlabel('values',fontsize=15)
plt.xticks(fontsize=15);
plt.yticks(fontsize=15);
#plt.title('Distribution of STD rates',fontsize=20)

#right histogram: data plotted after log transformation
plt.subplot(1,2,2)
bins = 10 ** np.arange(-1,4 + 0.1,0.1)
ticks = [0.1, 1, 3, 10, 30, 100, 300, 1000,5000]
plt.hist(df['rate'],bins=bins);
plt.xscale('log');
plt.xticks(ticks,ticks,fontsize=15)
plt.xlabel('log(values)',fontsize=15);
plt.yticks(fontsize=15);

From the above histograms,

Historgram on Left:

• The majority of data points with value above 500 mash the majority of the points into the bins on the far left.
• This indicates axis transformation is needed to visualize the distribution of std rates.

Histogram on Right:

The logarithmic scale transformation shows that the data is roughly bimodal, with first peak somewhere between 3 and 10, and second peak around 100, finally the largest peak around 300. The rate distribution cuts off at its maximum, rather than declining in a smooth tail.

In [ ]:
 

Bivariate Plots

Which states have the highest number of STD cases ?

In [15]:
case_counts = df['std_cases'].value_counts()
case_counts.index
Out[15]:
Int64Index([    3,     4,     5,     6,     7,     8,    10,     9,    11,
               12,
            ...
             1736,  1752,  5850,  7899,  1768,  1800, 10012,  1832,  5946,
             2137],
           dtype='int64', length=3209)
In [16]:
uni_plt = df.groupby(['state']).sum()
uni_plt.sort_values('std_cases',ascending=False,inplace=True)
uni_plt = uni_plt.reset_index('state')
uni_plt.head(10)
Out[16]:
state disease_code state_code year year_code std_cases population rate
0 Texas 185476 31008 1294254 1294254 1495910 1005846754 172034.69
1 California 183306 3834 1280239 1280239 1461879 1566844466 107672.69
2 Florida 183616 7680 1282240 1282240 946306 770167984 165967.15
3 Illinois 182996 10846 1278233 1278233 912044 562572144 197310.90
4 North Carolina 182996 23606 1278216 1278216 700211 382160611 226739.57
5 New York 175282 22068 1228191 1228191 641854 785144438 95128.28
6 Ohio 179586 24453 1256201 1256201 553881 506009309 137035.42
7 Pennsylvania 175866 25830 1232172 1232172 514592 544682798 119694.67
8 Georgia 183926 8333 1284240 1284240 503214 392677071 150135.03
9 Michigan 179896 16328 1258150 1258150 447758 441731864 125250.18
In [17]:
plt.figure(figsize=(17,10))
sb.barplot(data=df, x='state',y='std_cases',color=sb.color_palette()[0])
plt.title('Number of STD Cases in the States')
plt.xlabel('State')
plt.xticks(rotation=90)
plt.ylabel('STD Cases');

From the above barchart, we can see that:

• Texas and California have the highest number of STD Cases
• Vermont and Wypoming have the lowest number of STD Cases

How rates vary by States ?

In [18]:
df['rate'].describe()
Out[18]:
count    27053.000000
mean       260.536064
std        528.756490
min          0.090000
25%          8.770000
50%         56.210000
75%        255.840000
max       6982.070000
Name: rate, dtype: float64
In [19]:
plt.figure(figsize=(17,10))
sb.barplot(data=df, x='state',y='rate',color=sb.color_palette()[0])
plt.title('STD rates in the States')
plt.xlabel('State')
plt.xticks(rotation=90)
plt.ylabel('Rate');

From the above visualization, one can observe that:

• Alaska and DC recorded higher growth for STD cases in comparison to other states
• While New Jersey had the least growth by the end of 2011

In [ ]:
 

Mutlivariate Plots

Based on the bivariate plots it looks like there is something wrong with rate column.One of the ways is to use standard deviation of the rate column. Lets, recalculate rate and make a new data frame using groupby() with name: 'actual_rate1'

What is the overall trend of STDs from 1996-2011 ?

In [20]:
actual_rate1= df.groupby(['year','disease']).std()
actual_rate1.dropna(inplace=True)
In [21]:
actual_rate1 = actual_rate1.reset_index()
actual_rate1
Out[21]:
year disease disease_code state_code year_code std_cases population rate
0 1996 Chlamydia 0.0 15.543452 0.0 1173.871022 595429.537619 460.265207
1 1996 Gonorrhea 0.0 15.703611 0.0 562.335722 642293.733268 282.147477
2 1996 Primary and Secondary Syphilis 0.0 15.362291 0.0 27.162114 736912.067624 13.550793
3 1997 Chlamydia 0.0 15.559562 0.0 1258.874722 607545.997864 465.591104
4 1997 Gonorrhea 0.0 15.570287 0.0 556.349434 651066.909222 273.704937
5 1997 Primary and Secondary Syphilis 0.0 15.333483 0.0 23.188749 768987.310957 10.610993
6 1998 Chlamydia 0.0 15.656036 0.0 1418.409943 615165.127097 553.902357
7 1998 Gonorrhea 0.0 15.610511 0.0 649.619680 666970.643992 294.138330
8 1998 Primary and Secondary Syphilis 0.0 15.582441 0.0 19.153213 802754.132318 7.832609
9 1999 Chlamydia 0.0 15.661859 0.0 1614.579090 630425.212493 592.324453
10 1999 Gonorrhea 0.0 15.611055 0.0 681.911602 681357.007059 283.076018
11 1999 Primary and Secondary Syphilis 0.0 15.359317 0.0 18.098200 839925.280125 6.273269
12 2000 Chlamydia 0.0 15.666733 0.0 1737.351671 675902.561734 607.361566
13 2000 Gonorrhea 0.0 15.711072 0.0 685.296339 692787.396656 262.707658
14 2000 Primary and Secondary Syphilis 0.0 15.692655 0.0 17.444339 835424.573848 5.267350
15 2001 Chlamydia 0.0 15.680088 0.0 1841.660645 688898.018902 627.172652
16 2001 Gonorrhea 0.0 15.680784 0.0 654.413844 708405.413362 256.609949
17 2001 Primary and Secondary Syphilis 0.0 15.333264 0.0 21.628515 859622.016191 4.308148
18 2002 Chlamydia 0.0 15.680153 0.0 1966.045336 702610.743390 688.738400
19 2002 Gonorrhea 0.0 15.741738 0.0 634.262118 721380.081904 243.039217
20 2002 Primary and Secondary Syphilis 0.0 14.994349 0.0 34.546438 891438.161852 4.731427
21 2003 Chlamydia 0.0 15.664779 0.0 2113.191307 714970.490682 689.555149
22 2003 Gonorrhea 0.0 15.581085 0.0 606.288125 736193.119824 220.378573
23 2003 Primary and Secondary Syphilis 0.0 14.914133 0.0 41.322794 919135.949551 5.645833
24 2004 Chlamydia 0.0 15.607693 0.0 2175.894156 727268.487252 735.227636
25 2004 Gonorrhea 0.0 15.646352 0.0 598.458507 745514.396341 214.793022
26 2004 Primary and Secondary Syphilis 0.0 15.064218 0.0 45.075771 938700.262976 7.341664
27 2005 Chlamydia 0.0 15.687368 0.0 2257.311214 737411.391640 750.879841
28 2005 Gonorrhea 0.0 15.600251 0.0 622.345666 754743.823876 211.794028
29 2005 Primary and Secondary Syphilis 0.0 15.406943 0.0 49.632539 955140.277299 9.505743
30 2006 Chlamydia 0.0 15.689174 0.0 2323.857341 746898.525345 764.294502
31 2006 Gonorrhea 0.0 15.681503 0.0 668.628605 762497.149558 217.575694
32 2006 Primary and Secondary Syphilis 0.0 15.507576 0.0 53.162985 936346.649263 8.597310
33 2007 Chlamydia 0.0 15.668914 0.0 2509.793211 755421.242508 812.587034
34 2007 Gonorrhea 0.0 15.613505 0.0 687.093145 774489.586817 226.545681
35 2007 Primary and Secondary Syphilis 0.0 15.368993 0.0 60.733468 952093.433355 13.434676
36 2008 Chlamydia 0.0 15.663416 0.0 2763.050140 764776.414919 863.943825
37 2008 Gonorrhea 0.0 15.681613 0.0 681.666781 784301.296410 224.981776
38 2008 Primary and Secondary Syphilis 0.0 15.286277 0.0 65.758511 947463.460707 12.054088
39 2009 Chlamydia 0.0 15.708408 0.0 2852.563050 774765.641792 894.676099
40 2009 Gonorrhea 0.0 15.630304 0.0 633.217177 796204.139108 219.115391
41 2009 Primary and Secondary Syphilis 0.0 15.224054 0.0 64.796651 971541.090278 13.308597
42 2010 Chlamydia 0.0 15.683865 0.0 2997.534985 785585.742188 900.155039
43 2010 Gonorrhea 0.0 15.546582 0.0 663.521336 811244.389074 206.547339
44 2010 Primary and Secondary Syphilis 0.0 15.153316 0.0 66.143193 983784.206033 11.592261
45 2011 Chlamydia 0.0 15.654679 0.0 3168.175800 784694.446193 945.521047
46 2011 Gonorrhea 0.0 15.497336 0.0 668.960680 806579.238717 208.675047
47 2011 Primary and Secondary Syphilis 0.0 15.288464 0.0 73.033563 986810.307715 12.883141
In [22]:
plt.figure(figsize=(10,7))
sb.lineplot(x='year',y='rate',hue='disease',data=actual_rate1);
plt.title('Overall trend of STDs in US since 1996');
In [23]:
fig = px.line(actual_rate1, x='year', y='rate',color='disease')
fig.update_layout(
    title = "Overall trend of STDs in US since 1996"
)

fig.show()

Based on the above visualization, we conclude that

• Overall, Chlamydia show a constant increase reaching a peak of over 900
• Gonorrhea reported a steady decrese by the end of 2010
• On the other hand, Syphilis had a steady trend over the period of time.

How is the widespread of STDs over different age groups ?

In [24]:
fig = px.histogram(df, x='age', y='rate',color='disease')
fig.update_layout(
    title = "Widespread of STDs over ages"
)

fig.show()

Results from above histogram are

• Teenage years had paramount STDs rate
• Among STDs, Chlamydia was more prevalent in most age groups while Syphilis was the least

Of the features you investigated, were there any unusual distributions ? Did you perform any operations/transformations? If so, why did you do this?

• During the visualization of STD rate, the histogram generated was too much skewed to the left.
• This was corrected using log scale transformation which gave a clear idea on the distribution of the STD rates

Were there any surprising interactions between features ?

• In the bivariate plot of STD rate vs State, the rate showed out of ordinary behaviour. The attempt was made to fix rate column in multivariate plots that made sense when combined with features

Conclusion

From the visualizations above derivations made are:

• Texas has maximal number of STD Cases
• Texas and California reported the largest rise in STDs
• Overall, Chlamydia has increased over the years
• The first drop for Gonorrhea was in year 2001
• The rates across the ages is unimodal with age 20-24 years having the highest rate of STDs
• From multivariate plots, it can be derived that Chlamdydia was more common among STDs
• States with low STD cases like Vermont and Wyoming also have least rate. This proves there is a propotional relation between STD cases and rate. But in case of the states with highest STD Cases this relationship does not follow. Thus, further analysis is required.

Thank you,Udacity!